In [8]:
import pandas as pd
import numpy as np

Загрузим данные


In [9]:
from sklearn.datasets import load_boston

In [10]:
bunch = load_boston()

In [11]:
print(bunch.DESCR)


Boston House Prices dataset
===========================

Notes
------
Data Set Characteristics:  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive
    
    :Median Value (attribute 14) is usually the target

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's

    :Missing Attribute Values: None

    :Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
http://archive.ics.uci.edu/ml/datasets/Housing


This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.   
     
**References**

   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
   - many more! (see http://archive.ics.uci.edu/ml/datasets/Housing)


In [12]:
X, y = pd.DataFrame(data=bunch.data, columns=bunch.feature_names.astype(str)), bunch.target

In [13]:
X.head()


Out[13]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33

Зафиксируем генератор случайных чисел для воспроизводимости:


In [14]:
SEED = 22
np.random.seed = SEED

Домашка!

Разделим данные на условно обучающую и отложенную выборки:


In [15]:
from sklearn.model_selection import train_test_split

In [16]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=SEED)

In [17]:
X_train.shape, y_train.shape, X_test.shape, y_test.shape


Out[17]:
((404, 13), (404,), (102, 13), (102,))

Измерять качество будем с помощью метрики среднеквадратичной ошибки:


In [18]:
from sklearn.metrics import mean_squared_error

Задача 1.

Обучите LinearRegression из пакета sklearn.linear_model на обучающей выборке (X_train, y_train) и измерьте качество на X_test.

P.s. Ошибка должна быть в районе 20.

In [25]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score

clf = LinearRegression()
clf.fit(X_train, y_train);

print('Вышла средняя ошибка, равная %5.4f' % \
            (-np.mean(cross_val_score(clf, X_test, y_test, cv=5, scoring='neg_mean_squared_error'))))


Вышла средняя ошибка, равная 21.7029

Задача 2. (с подвохом)

Обучите SGDRegressor из пакета sklearn.linear_model на обучающей выборке (X_train, y_train) и измерьте качество на X_test.

In [41]:
from sklearn.linear_model import SGDRegressor
from sklearn.preprocessing import StandardScaler

ss = StandardScaler()
X_scaled = ss.fit_transform(X_train)
y_scaled = ss.fit_transform(y_train)

sgd = SGDRegressor()
sgd.fit(X_scaled, y_scaled);

print('Вышла средняя ошибка, равная %5.4f' % \
            (-np.mean(cross_val_score(sgd, X_scaled, y_scaled, cv=5, scoring='neg_mean_squared_error'))))


Вышла средняя ошибка, равная 0.3137
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\preprocessing\data.py:586: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\preprocessing\data.py:649: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)

Задача 3.

Попробуйте все остальные классы:
  • Ridge
  • Lasso
  • ElasticNet

В них, как вам уже известно, используются параметры регуляризации alpha. Настройте его как с помощью GridSearchCV, так и с помощью готовых -CV классов (RidgeCV, LassoCV и т.д.).

Найдите уже, в конце-концов, самую точную линейную модель!

In [61]:
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import RidgeCV

############Ridge
params = { 
    'alpha': [10**x for x in range(-2,3)]
}

from sklearn.linear_model import Ridge

gsR = RidgeCV() #GridSearchCV(Ridge(), param_grid=params)
gsR.fit(X_train, y_train);

print('Вышла средняя ошибка, равная %5.4f' % \
            (-np.mean(cross_val_score(gsR, X_test, y_test, cv=5, scoring='neg_mean_squared_error'))))


Вышла средняя ошибка, равная 21.3110

In [63]:
############Lasso
from sklearn.linear_model import Lasso
from sklearn.linear_model import LassoCV

gsL = GridSearchCV(Lasso(), param_grid=params) #LassoCV() - медленнее
gsL.fit(X_train, y_train);

print('Вышла средняя ошибка, равная %5.4f' % \
            (-np.mean(cross_val_score(gsL, X_test, y_test, cv=5, scoring='neg_mean_squared_error'))))


Вышла средняя ошибка, равная 21.2454

In [59]:
from sklearn.linear_model import ElasticNet
from sklearn.linear_model import ElasticNetCV

gsE = GridSearchCV(ElasticNet(), param_grid=params) #ElasticNetCV() - просто заменить, не слишком точен
gsE.fit(X_train, y_train);

print('Вышла средняя ошибка, равная %5.4f' % \
            (-np.mean(cross_val_score(gsE, X_test, y_test, cv=5, scoring='neg_mean_squared_error'))))


Вышла средняя ошибка, равная 21.3403

Итого самый точный среди этих трёх - GridSearchCV + Lasso

Задача 4.

Проверять качество правильно на кросс-валидации, как известно. Вы знаете, что делать: подключаем cross_val_score из sklearn.model_selection. Параметр cv установите равным 5.

Вспомните про все штуки, которым мы с вами научились.

Добейтесь MSE < 27.

Oops! Все случаи уже были рассмотрены для cross_val_score